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Abstract 

Recent  research  efforts  have  shown  the  benefits  of  integrating  functional  and  data 
parallelism  over  using  either  pure  data  parallelism  or  pure  functional  parallelism.  The 
work  in  this  paper  presents  a  theoretical  framework  for  deciding  on  a  good  execution ' 
strategy  for  a  given  program  based  on  the  available  functional  and  data  parallelism  in  ' 
the  program.  The  framework  is  based  on  assumptions  about  the  form  of  computation 
and  communication  cost  functions  for  multicomputer  systems.  We  present  mathemati¬ 
cal  functions  for  these  costs  and  show  that  these  functions  are  realistic.  The  framework 
also  requires  specification  of  the  available  functional  and  data  parallelism  for  a  given 
problem.  For  this  purpose,  we  have  developed  a  graphical  programming  tool.  Cur¬ 
rently,  we  have  tested  our  approach  using  three  benchmark  programs  on  the  Thinking 
Machines  CM-5  and  Intel  Paragon.  Results  presented  show  that  the  approach  is  very 
effective  and  can  provide  a  two-  to  three-fold  increase  in  speedups  over  approaches 
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1  Introduction 


Distributed  Memory  Multicomputers  such  as  the  IBM  SP-1,  the  Intel  Paragon  and  the 
Thinking  Machines  CM-5  offer  significant  advantages  over  shared  memory  multiprocessors 
in  terms  of  cost  and  scalability.  Unfortunately,  to  extract  all  that  computational  power  from 
these  machines,  users  have  to  write  efficient  software  for  them,  which  is  an  extremely  labo¬ 
rious  process.  Numerous  research  efforts  have  proposed  language  extensions  to  FORTRAN 
in  order  to  ease  programming  multicomputers;  the  most  prominent  one  has  been  the  HPF 
language  standardization  [1].  A  number  of  compilers  for  HPF  have  been  proposed;  these 
include  the  FORTRAN-D  compiler  from  Rice  University  [2],  the  SUIF  compiler  from  Stan¬ 
ford  [3],  the  PTRAN  II  compiler  from  IBM  [4],  the  SUPERB  compiler  from  the  University 
of  Vienna  [5],  and,  the  FORTRAN-90D/HPF  compiler  from  Syracuse  University  [6]. 

The  PARADIGM  compiler  project  at  Illinois  is  aimed  at  devising  a  parallelizing  compiler 
for  distributed  memory  multicomputers  that  will  accept  FORTRAN  77  or  HPF  programs  as 
input.  The  fully  implemented  PARADIGM  compiler  will: 

•  Annotate  FORTRAN  77  programs  with  HPF  data  distribution  directives  [7,  8]. 

•  Partition  computations  and  generate  communication  for  HPF  programs  [9,  10,  11]. 

•  Exploit  functional  and  data  pjirallelism  in  HPF  programs  [12,  13,  14,  15). 

•  Provide  runtime  support  for  irregular  computations  [16]. 

There  has  been  a  lot  of  interest  in  simultaneous  exploitation  of  data  and  functional 
parallelism.  Research  efforts  in  the  area  include  the  Fx  compiler  from  CMU  [17],  the 
FORTRAN-M  compiler  from  Argonne  National  Lab  [18],  the  work  by  Chapman  et.  al.  in 
[19],  the  work  by  Cheung  and  Reeves  in  [20],  the  work  by  Girkar  and  Polychronopoulos  in 
[21],  and,  the  work  by  Ramaswamy  and  Banerjee  in  [22].  All  these  efforts  recognize  the 
benefits  of  using  both  types  of  parallelism  together  to  achieve  better  performance  for  certain 
applications.  In  this  paper,  we  have  discussed  the  framework  to  be  used  in  the  PARADIGM 
compiler  for  exploiting  functional  and  data  parallelism  together. 

For  our  discussion,  we  define  Functional  Parallelism  to  be  any  parallelism  existent  among 
the  various  routines  of  a  given  program  2uid  Data  parallelism  to  be  parallelism  within  a  rou¬ 
tine  that  is  obtained  by  distributing  data  among  all  processors  involved  and  having  them 
each  perform  computation  using  the  owner  computes  rule  [2].  Matrix  Add,  Matrix  Multiply 
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and  2D  FFT  are  a  few  examples  of  what  we  mean  by  routines.  To  re-emphasize,  the  def¬ 
initions  of  functional  and  data  parallelism  above  may  not  correspond  to  some  of  the  other 
popular  connotations  of  these  terms. 

1.1  Macro  Dataflow  Graphs 

In  order  to  expose  the  parallelism  available  in  any  given  program,  we  use  a  representation 
called  the  Macro  Dataflow  Graph  (MDG).  This  representation  has  been  used  before  by 
researchers  such  as  Sarkar  in  [23]  and  Prasaxma  and  Agarwal  in  [24].  For  our  work,  the 
MDG  representation  for  a  program  is  defined  to  be  a  weighted  directed  acyclic  graph  whose 
nodes  correspond  to  routines  of  the  program  and  edges  correspond  to  precedence  constraints 
among  these  routines.  There  axe  two  distinguished  nodes  in  the  MDG  called  START  and 
STOP.  START  precedes  adl  other  nodes  and  STOP  succeeds  all  other  nodes. 

The  weights  of  the  nodes  and  edges  are  based  on  the  concepts  of  Processing  and  Data 
Transfer  costs.  The  time  required  for  the  execution  of  a  routine  is  called  its  processing 
cost.  Processing  costs  will  depend  on  the  number  of  processors  used  to  execute  the  routine 
and  include  all  computation  and  commuzucation  costs  incurred.  On  distributed  memory 
machines  these  costs  will  be  dependent  on  the  kind  of  data  distribution  used.  Each  routine 
may  need  a  particular  distribution  for  each  of  its  arrays  to  achieve  the  best  performance. 
Since  precedence  constraints  mean  that  an  array  being  read  by  a  routine  is  being  written 
into  by  its  predecessor,  we  may  need  to  redistribute  the  array  after  the  execution  of  the 
predecessor  routine.  The  time  needed  for  this  data  redistribution  between  the  execution  of 
a  pair  of  routines  is  referred  to  as  the  data  transfer  cost  for  that  pair.  Data  trzmsfer  costs 
axe  made  up  of  three  components  :  a  sending  cost  for  processors  at  the  sending  routine,  a 
network  cost,  and,  a  receiving  cost  for  processors  at  the  receiving  routine.  All  these  cost 
components  are  functions  of  the  number  of  processors  used  for  the  sending  and  receiving 
routines. 

We  consider  the  weight  of  a  node  in  the  MDG  to  be  composed  of: 

1.  The  receiving  cost  components  of  all  data  transfers  from  its  predecessors 

2.  The  processing  cost  of  the  routine  it  corresponds  to 

3.  The  sending  cost  components  of  all  data  transfers  to  its  successors 

The  two  distinguished  nodes  START  and  STOP  do  not  perform  any  computation,  they  have 
zero  weight. 


2 


Figure  1:  Example  Showing  Functional  Parallelism 


The  weight  of  an  edge  between  a  pair  of  nodes  in  the  MDG  is  taken  to  be  the  network 
cost  component  of  data  transfer  between  the  routines  corresponding  to  the  nodes. 

The  usefulness  of  MDGs  is  that  they  can  be  used  to  decide  on  the  strategy  to  be  used  to 
minimize  execution  time  of  the  given  program  on  the  target  multicomputer.  MDGs  expose 
functional  and  data  parallelism  in  the  program,  allowing  us  to  exploit  both  in  an  optimal 
manner.  Data  parallelism  information  is  implicit  in  the  weight  functions  of  the  nodes  and 
functional  parallelism  is  implicit  in  the  precedence  constraints  eunong  nodes.  In  order  to 
decide  on  a  good  execution  strategy  for  a  program,  we  use  an  Allocation  and  Scheduling 
approach.  Allocation  decides  on  the  number  of  processors  to  use  for  each  node  in  the 
MDG  and  scheduling  decides  on  a  scheme  of  execution  for  the  allocated  nodes  on  the  target 
multicomputer.  Our  work  in  this  paper  provides  methods  that  allocate  2md  schedule  any 
given  MDG  such  that  the  finish  time  obtained  is  within  a  factor  of  the  best  finish  time 
theoretically  obtainable. 

1.2  Example 

The  usefulness  of  good  allocation  and  scheduling  may  not  be  clear  at  first  sight.  It  can  be 
better  appreciated  by  considering  an  ex2«nple.  Figure  1  shows  an  MDG  with  three  nodes 
Ni,  N2  and  N3.  Plotted  sdongside  are  the  processing  costs  of  the  routines  they  correspond  to 
as  a  function  of  the  number  of  processors  used.  For  ejise  of  understanding  we  aissume  there 
are  no  data  transfer  costs  between  routines.  By  our  definitions,  the  weights  of  the  nodes  in 
this  MDG  would  be  the  same  as  the  corresponding  processing  costs  and  the  weight  of  edges 
would  be  0.  Now,  given  a  system  with  4  processors,  there  could  be  many  ways  in  which  we 
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can  allocate  and  schedule  the  MDG.  For  instance,  a  naive  scheme  would  be  to  execute  the 
nodes  one  after  another  on  all  4  processors.  In  this  case,  we  have  an  execution  time  of  15.6 
seconds.  On  the  other  hand,  a  better  way  of  executing  the  MDG  would  be  to  first  execute 
Ni  on  all  4  processors,  then  allocate  2  processors  each  to  nodes  and  N3  and  execute  them 
concurrently.  This  way,  the  routines  finish  in  14.3  seconds.  The  two  schemes  are  shown 
pictorially  in  Figure  2.  The  first  scheme  is  exploiting  pure  data  parallelism,  i.e.,  all  routines 
use  4  processors.  The  second  scheme  on  the  other  hand,  is  exploiting  both  functional  and 
data  parallelism,  i.e.,  routines  2  and  3  execute  concurrently  as  well  as  use  2  processors  each. 

Intuitively,  good  allocation  and  scheduling  makes  program  execution  faster  because  of 
more  efficient  execution.  Most  real  applications  execute  more  inefficiently  as  the  size  of  a 
processor  system  grows,  the  processing  efficiency  curves  of  Figure  1  in  our  example  are 
typical.  We  can  see  that  by  executing  the  nodes  iVj  and  Nj  concurrently  and  using  fewer 
processors  for  them,  the  second  scheme  improves  overall  efficiency  over  the  first.  This  makes 
the  second  scheme  execute  the  program  faster  than  the  first. 

A  point  of  interest  with  respect  to  the  type  of  code  generated  in  the  two  schemes  is  that 
the  first  scheme  will  essentially  have  each  processor  execute  similar  code  on  different  data 
sets.  We  refer  to  this  type  of  code  as  Single  Program  Multiple  Data  (SPMD).  On  the  other 
hand,  the  second  scheme  can  have  very  different  code  for  each  processor;  this  type  of  code  is 
called  Multiple  Program  Multiple  Data  (MPMD).  Therefore,  SPMD  code  exploits  only  data 
parallelism  while  MPMD  code  exploits  both  data  and  functional  parallelism. 

1.3  Allocation  and  Scheduling 

The  basic  problem  of  optimally  scheduling  a  set  of  nodes  with  precedence  constraints  on  a  p 
processor  system  when  each  node  uses  just  one  processor  has  been  shown  to  be  NP-complete 
by  Lenstra  and  Kan  in  [25].  Further  treatment  on  this  topic  can  adso  be  found  in  the  book 
by  Garey  and  Johnson  [26].  The  allocation  and  scheduling  problem  is  considerably  harder 
than  the  one  just  described.  There  have  been  two  major  approaches  to  the  approximate 
solution  of  the  allocation  and  scheduling  problem.  The  first  has  been  a  bottom  up  approach 
like  those  used  by  Sarkar  in  [23],  and  Gerasoulis  and  Yang  in  [27,  28].  A  bottom  up 
approach  considers  the  MDG  to  be  made  up  of  lightweight  nodes  (in  terms  of  computation 
requirements)  with  each  using  only  one  processor  (an  explicit  allocation  is  not  done).  The 
bottom  up  scheduling  methods  of  [23,  27,  28]  then  use  clustering  on  the  nodes  to  form 
larger  nodes  during  the  construction  of  a  schedule.  The  second  approach  to  allocation  and 
scheduling  is  a  top  down  approach  like  the  ones  used  by  Prasanna  and  Agarwed  in  [24], 
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(a)  Naive  Scheme,  ^15.6  Secs.  (b)  Better  Scheme,  f  ^14.3  Secs. 

Figure  2:  Allocation  and  Scheduling  Schemes  for  Example 


Belkhale  and  Banerjee  in  [12,  13],  by  Subhlok  et.  al.  in  [17,  29,  30],  by  Ramaswamy  and 
Banerjee  in  [14, 15]  and  in  this  paper.  Top  down  approaches  start  with  the  assumption  of 
heavyweight  nodes  (again,  in  terms  of  computation  requirements)  in  the  MDG  and  break 
them  down  during  the  process  of  constructing  an  optimzJ  schedule.  Top  down  methods  are 
considered  better  in  that  they  take  a  more  global  view  of  the  problem  tham  the  bottom  up 
approaches.  Therefore  they  may  be  able  to  perform  better  optimizations. 

The  difference  between  earlier  top  down  approaches  mentioned  above  and  the  work  pre¬ 
sented  here  is  significauit.  The  methods  presented  in  [24]  do  not  consider  data  tramsfer  costs 
between  nodes  of  the  MDG.  In  addition,  they  make  simplifying  assumptions  about  the  type 
of  MDGs  handled  amd  the  processing  cost  model  used.  We  do  not  make  any  assumptions  for 
our  MDGs  amd  use  very  realistic  cost  models.  The  work  in  [12,  13]  also  does  not  consider 
the  effects  of  non-zero  data  transfer  costs.  Their  allocation  amd  scheduling  algorithms  are 
similau’  to  the  ones  we  use.  The  research  presented  in  [17,  29,  30]  considers  allocation  and 
scheduling  for  a  claiss  of  problems  that  process  continuous  streauns  of  data  sets.  The  compu¬ 
tation  for  eaich  data  set  has  a  tree-structured  MDG  for  all  their  benchmark  programs  [31]. 
A  .set  of  heuristics  atre  used  to  decide  on  a  good  allocation  amd  scheduling  scheme.  There  is 
no  performamce  analysis  of  these  heuristics  amd  it  is  not  clear  how  they  would  work  for  more 
general  MDGs  (DAGs).  Our  methods  on  the  other  hamd  have  been  theoretically  analyzed 
for  performamce  bounds  amd  work  well  for  general  MDGs  as  we  will  show. 
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Figure  3:  Startup  View  of  MAST 


1.4  MAST  :  MDG  Allocation  and  Scheduling  Tool 

In  order  to  provide  an  interface  to  our  MDG  allocation  and  scheduling  methods,  we  designed 
cind  implemented  MAST.  Some  of  the  ideas  used  for  MAST  aire  similar  to  the  ones  used  for 
the  HeNCE  tool  [32].  Basically,  MAST  provides  users  with  the  capability  of  specifying  the 
MDG  representation  for  their  programs  in  a  graphical  manner.  Once  am  MDG  has  been 
specified,  MAST  helps  the  user  study  the  performance  of  his  code  on  vauious  architectures 
and  run  the  code  if  needed  on  ainy  of  those  au-chitectures. 

MAST  has  three  major  components  to  it: 

1.  A  graphical  prograunming  tool 

2.  A  library  of  pairaillel  scientific  routines  whose  execution  is  well  profiled  on  ail  the  desired 
target  multicomputers 

3.  An  allocation  and  scheduling  tool  based  on  the  methods  discussed  in  this  paper 

MAST  ties  up  the  three  components  and  provides  the  user  with  various  utilities.  A 
step  by  step  use  of  MAST  hais  been  shown  in  Figures  3  through  6.  We  now  explain  the 
significance  of  each  figure: 
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Figure  4:  View  of  MAST  After  Nodes  Have  Been  Drawn 


Figure  5;  View  of  MAST  After  a  Complete  MDG  Has  Been  Drawn 
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Figure  6:  View  of  MAST  After  an  Update  Statistics 


Figure  3  shows  MAST  when  it  is  started  up.  At  this  point  the  graphical  programming  tool 
on  the  left  half  of  MAST  has  only  two  nodes  -  START  and  STOP. 

Figure  4  shows  MAST  aiter  the  user  has  decided  on  the  routines  to  be  used  in  his  program 
and  placed  the  required  nodes.  Nodes  axe  drawn  using  the  one  of  the  utilities  of  the 
graphical  programming  tool  that  can  be  seen  on  the  top  left  comer  of  the  canvais.  Once 
a  node  has  been  drawn,  it  can  be  tied  to  a  library  routine  using  a  pull  down  menu 
provided  (shown  in  figure).  On  closer  inspection  of  the  figure,  the  nodes  can  be  seen  to 
have  different  routine  names  on  them.  It  can  also  be  seen  that  each  node  has  little  tag 
boxes  on  top  and  bottom;  these  represent  the  input  2uid  output  arrays  for  the  routine 
the  node  represents.  Different  routines  have  different  numbers  of  tag  boxes  depending 
on  their  input  emd  output  array  counts. 

Figure  -5  shows  MAST  after  the  user  has  connected  the  nodes  using  the  edge  drawing 
utility  of  the  graphical  programming  tool.  Edges  connect  an  output  tag  box  of  a  node 
to  an  input  tag  box  of  another  node.  This  indicates  the  array  being  written  into  by 
the  first  node  is  being  read  by  the  second  node.  Output  tag  boxes  may  have  multiple 
edges,  input  tag  boxes  can  have  only  one  edge. 
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Figure  6  shows  MAST  after  the  user  has  completed  the  MDG  and  has  asked  for  a  perfor- 
mamce  evaluation  on  a  specified  target  architecture  of  a  specified  size.  This  performajice 
evaluation  uses  the  execution  profile  information  of  the  scientific  library.  Performeince 
statistics  provided  include  predicted  uniprocessor  time,  SPMD  time  and  MPMD  time. 
Speedup  and  efficiency  predictions  are  also  provided  for  the  SPMD  and  MPMD  cases; 
aJso  provided  is  a  Gantt  chart  showing  the  allocation  and  scheduling  used  for  the 
MPMD  case.  In  addition,  MAST  uses  the  allocation  and  schedule  computed  to  gener¬ 
ate  source  code  containing: 

•  Calls  to  routines  in  the  scientific  library  provided  in  MAST. 

•  Routines  generated  for  data  redistribution  -  this  is  done  using  the  work  discussed 
in  [22].  In  that  paper,  techniques  for  redistributing  axrays  (for  regular  distribu¬ 
tions'!  between  arbitrary  processor  sets  have  been  discussed. 

•  Routines  generated  to  enable  the  scientific  routines  and  redistribution  routines  to 
execute  on  subsets  of  processors.  These  routines  aie  bcised  on  concepts  similar  to 
those  of  groups,  contexts  and  communicators  in  MPI  [33]. 

This  generated  code  is  ready  to  be  compiled  and  executed  on  the  target  architecture. 

In  contrast  to  our  graphical  programming  approach,  other  researchers  in  the  area  of 
integrating  data  and  functional  parallelism  have  used  language  extensions  for  specifying 
avaulable  data  and  functional  parallelism.  The  work  in  [17, 29,  30]  on  the  Fx  compiler  is  based 
on  extensions  of  FORTRAN  which  a^e  used  to  specify  functional  and  data  parallelism.  Data 
parallelism  is  specified  using  constructs  similar  to  HPF  amd  functional  parallelism  is  specified 
using  constructs  called  Parallel  Sections.  The  FORTRAN-M  lamguage  inherently  provides 
constructs  for  specifying  functional!  parallelism  [18];  recent  work  proposes  to  integrate  the 
language  with  HPF  in  order  to  specify  data  parallelism  [34].  The  work  in  [19]  hais  proposed 
extensions  to  FORTRAN-90  for  specifying  functionad  parallelism. 

In  the  next  section,  we  provide  a  brief  overview  of  the  theory  of  convex  and  posynomial 
functions  which  we  use  in  developing  our  allocation  algorithm.  In  the  following  sections  we 
discuss  our  allocation  and  scheduling  algorithms.  We  then  present  our  processing  and  data 
transfer  cost  models  in  Section  5.  Theoreticad  results  that  discuss  the  optimality  of  our 
algorithms  are  provided  in  Section  6.  Section  7  provides  preliminary  results  obtained  using 
our  algorithms. 
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Convex  Set  Non-convex  Set 

Figure  7:  Convex  sets. 

2  Theory  of  Convex  and  Posynomial  Functions 

In  this  section  we  provide  a  brief  overview  of  the  theory  of  convex  and  posynomial  functions. 
A  detailed  discussion  of  convex  functions  and  convex  programming  can  be  found  in  Luen- 
berger’s  book  [35].  A  discussion  of  the  theory  of  posynomial  functions  is  provided  by  Ecker 
in  [36].  We  have  selected  a  few  important  and  relev<tnt  points  about  these  functions  for  our 
discussion  here. 

2.1  Convex  Sets 

A  set  C  in  R"  is  said  to  be  convex  if,  for  every  Xx,  Xj  €  C,  eind  every  real  number  a, 
0  <  a  <  1,  the  point  axi  +  (1  —  a)x2  €  C. 

This  definition  czm  be  interpreted  geometrically  eis  stating  that  a  set  is  convex  if,  given 
two  points  in  the  set,  every  point  on  the  line  segment  joining  the  two  points  is  aJso  a  member 
of  the  set.  Examples  of  convex  amd  nonconvex  sets  aure  shown  in  Figure  7. 

2.2  Convex  Functions 

Definition  :  A  function  /  defined  on  a  convex  set  Q  is  said  to  be  convex  if,  for  every  Xx ,  Xj 
€  n,  and  every  a,  0  <  a  <  1, 

/(axi  +  (1  -  ajxj)  <  a/(xx)  +  (1  -  at)/(x2).  (1) 

/  is  said  to  be  strictly  convex  if  the  inequality  in  Equation  (1)  is  strict  for  0  <  a  <  1. 

Geometrically,  a  function  is  convex  if  the  line  joining  two  points  on  its  graph  is  always 
above  the  graph.  Examples  of  convex  and  nonconvex  functions  are  shown  in  Fig  8. 
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Convex  Function  Concave  Function  Neither  convex 

nor  concave 

Figure  8:  Convex  Functions 

2.3  The  Convex  Programming  Problem 

The  convex  programming  problem  is  stated  as  follows: 

minimize  /(x)  (2) 

such  that  X  €  5  (3) 

where  /  is  a  convex  function  and  5  is  a  convex  set. 

This  problem  has  the  property  that  any  local  minimum  of  /  over  5  is  a  globed  minimum, 
thereby  easing  the  optimization  process  since  it  is  uimecessaxy  to  perform  hill-climbing  out 
of  local  minima. 

2.4  Posynomial  Functions 

A  posynomial  is  a  function  ^  of  a  positive  variable  x  €  R“  that  has  the  form 

5(x)  =  H  7j  n  (4) 

7  1=1 

where  the  exponents  a,j  €  R  and  the  coefficients  7j  >  0.  A  posynomial  is  a  function  that  is 
similar  to  a  polynomial,  except  that 

-  The  coefficients  must  be  positive. 

-  An  exponent  a,j  could  be  any  real  number,  and  not  necessarily  a  positive  integer, 
unlike  the  case  of  polynomials. 


A  posynomial  has  the  useful  property  that  it  can  be  mapped  onto  a  convex  function  through 
an  elementary  V2triable  transformation  [36] 

(xi)  =*  (c**)  (5) 

Such  a  functional  form  is  very  desirable,  since  such  a  transformation  maps  the  problem  of 
minimizing  a  posynomial  function  under  posynomial  constraints  to  a  convex  programming 
problem. 

For  example,  the  function  S.Txj'^x^+l.Sxi'^Xj*^  is  a  posynomial  in  the  variables  xi ,  Xj,  X3. 
A  few  other  examples  include: 


f{Xi)  * 

1/Xi 

(6) 

fi^i)  - 

constant 

(7) 

f{Xi,Xj)  =s 

fi 

(8) 

Xj 

fi^i)  * 

Xi 

(9) 

The  fact  that  these  functions  are  posynomials  will  be  used  later  in  the  paper. 

f 

2.5  A  few  Properties  of  Convex  and  Posynomial  Functions 

If  /  and  g  are  convex  functions  defined  on  a  convex  set  5,  then  the  following  properties  hold: 
Sum  Property  The  functions  /  +  is  a  convex  function  over  5. 

Constant  Property  The  function  c  •  /,  where  c  is  a  non-negative  constant,  is  a  convex 
function  over  5. 

Max  Property  The  function  max{ f,g)  is  a  convex  function  over  5. 

Min  Property  The  function  min(/,^)  is  a  convex  function  over  S. 

As  shown  before,  posynomials  can  be  transformed  to  convex  functions  using  Equation 
■5.  Therefore,  given  two  posynomial  functions  h  amd  j  defined  on  5,  aill  the  above  properties 
hold  for  the  pair.  We  will  use  these  properties  later  in  the  paper. 
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3  MDG  Allocation  Algorithm 

We  first  consider  the  problem  of  adlocation  of  processors  to  the  nodes  of  a  given  MDG.  After 
the  allocation  is  carried  out  using  this  algorithm,  the  MDG  is  ready  to  be  scheduled  using 
the  algorithm  described  in  the  next  section. 

For  the  purposes  of  allocation  and  scheduling,  we  assume  the  given  MDG  has  n  nodes 
numbered  consecutively  from  1  to  n.  In  addition,  node  1  is  the  START  node  and  node  n  is 
the  STOP  node. 

To  obtain  an  optimum  solution  to  the  allocation  problem  for  a  given  MDG  and  a  given 
p  processor  target  system,  we  solve: 


minimize  where: 

$  =  max(Ap,  Cp) 

Ti-p( 

”  Vn 

yi  =  maXm^PREDiiym  +  ^mi)  + 

Ti  =  {T,m€PREDi  +2n€SI/CC.  ^fn) 


where 


1.  Pi  represents  the  number  of  processors  used  by  the  ith  node. 

2.  tf  is  the  processing  cost  of  the  routine  corresponding  to  node  i  and  is  a  function  of  pi. 

3.  represents  the  time  required  at  node  i  to  process  the  messages  it  receives  from 
predecessor  node  m  (receiving  cost  component  of  data  tramsfer).  represents  the 
network  delay  required  between  the  completion  of  node  i  aind  the  start  of  node  m 
(network  cost  component  of  data  transfer,  weight  of  edge  between  nodes  m  and  i). 
represents  the  time  required  at  node  t  to  process  messages  it  sends  to  successor  node 
n  (sending  cost  component  of  data  transfer).  All  these  quantities  are  functions  of  p,- 
and  pj. 

4.  PREDi  and  SUCCi  the  sets  of  predecessor  and  successor  nodes  of  node  i  in  the 
given  MDG,  respectively. 
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5.  Ti  is  the  total  time  required  to  process  node  i  (weight  of  node  t). 

6.  y,  is  the  finish  time  of  the  tth  node. 

7.  $  is  the  Optimum  Finish  Time  obtainable  for  the  execution  of  the  program  corre¬ 
sponding  to  the  given  MOG. 

8.  Ap  is  also  called  the  Average  Finish  Time  for  the  case  when  nodes  use  v  i  processors 
each.  To  better  imderstand  the  idea  behind  using  the  average  finish  consider  a 
quantity  called  processor-time  area  for  a  node.  This  is  the  product  of  time  taken  for 
executing  a  node  and  the  number  of  processors  it  uses.  If  we  sum  the  processor-time 
areas  for  all  nodes  in  the  MDG,  this  will  represent  the  minimum  processor-time  area 
requirement  for  the  MDG.  Another  way  of  saying  the  same  is  that  $  must  be  at  least 
same  as  the  average  finish  time  which  represents  the  sum  of  processor-time  areas  of  all 
the  nodes  in  the  MDG  averaged  over  p. 

9.  Cp  is  called  the  Critical  Path  Time  for  the  case  when  nodes  use  up  to  p  processors  each. 
Since  the  critical  path  is  the  longest  in  the  MDG,  it  represents  the  shortest  possible 
time  in  which  we  can  finish  executing  the  MDG.  This  implies  $  must  be  at  least  same 
as  the  critical  path  time. 

The  free  variables  in  this  formulation  are  the  pi’s,  with  1  <  p,  <  p,  z  =  l,n. 

Our  formulation  relies  on  the  properties  of  convex  functions  and  posynomial  functions 
discussed  in  the  previous  section.  Basically,  our  allocation  problem  is  equivalent  to  a  convex 
programming  formulation  if  the  following  conditions  hold: 

1.  tfj,  tfj,  tfj,  and  tf  can  all  be  represented  by  posynomial  functions  of  the  free  variables. 

2.  tf^j  •  pj,  tfj  •  Pi  cind  tf  •  Pi  are  also  posynomial  functions  of  the  free  variables. 

Later,  in  Section  5,  we  present  cost  functions  to  represent  the  quantities  tfj,  tjj,  tfj,  and 
tf  which  satisfy  these  conditions.  We  also  demonstrate  the  praw:ticality  of  these  functions 

The  discussion  above  implies  that  in  practice,  we  can  construct  a  formulation  equiva¬ 
lent  to  a  convex  programming  formulation  for  allocation,  and,  therefore,  obtain  a  unique 
minimum  value  for  $.  The  allocation  that  corresponds  to  this  value  will  be  an  optimum 
allocation  for  the  given  MDG.  This  method  of  allocation  inherently  assumes  the  existence 
of  a  perfect  scheduler,  i.e.  one  that  can  produce  a  schedule  which  finishes  the  program  in 
$  time  units.  In  practice,  producing  such  a  schedule  is  an  NP-Complete  problem  [26].  We 
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therefore,  use  a  scheduler  as  described  in  the  next  section  which  might  produce  a  finish  time 
different  from  As  we  will  show  in  Section  6,  we  have  quantified  this  deviation. 

4  MDG  Scheduling  Algorithm 

To  schedule  a  given  MDG  with  processor  allocation  done  according  to  the  method  described 
in  the  previous  section,  we  use  an  algorithm  called  the  Prioritized  Scheduling  Algorithm 
(PSA).  The  steps  involved  in  the  PSA  are; 

1.  The  processor  allocation  produced  by  the  convex  programming  formulation  will  be  a 
set  of  positive  real  numbers  in  the  general  case,  however,  we  cannot  allocate  processors 
in  this  manner  on  a  real  system.  In  this  step  we  round-off  the  allocated  processors  for 
all  the  nodes  to  the  nearest  power  of  2.  This  is  done  to  make  the  final  code  generation 
very  easy.  The  resiilts  we  obtain  in  Section  7  will  show  that  this  does  not  result  in  much 
loss  in  practice.  We  refer  to  this  step  in  the  sections  that  follow  as  the  rounding-off 
step. 

2.  The  rounded-off  processor  allocation  for  the  MDG  is  then  modified  to  impose  a  bound 
{PB)  on  the  number  of  processors  used  by  any  node.  If  the  ith  node  uses  p,-  processors 
and  Pi  >  PB,  pi  is  reduced  to  PB,  else  it  is  left  unchanged.  It  can  be  seen  that  PB 
has  to  be  a  power  of  two  or  else  we  will  have  to  round-off  again  and  that  may  le2ui  to 
a  violation  of  the  bound.  The  value  of  PB  to  be  used  is  determined  using  Theorem  -3 
which  is  discussed  in  Section  6.  We  refer  to  this  step  in  the  sections  that  follow  ais  the 
bounding  step. 

3.  Since  the  processor  allocation  for  the  MDG  may  have  been  ch2mged  from  the  value 
produced  by  the  allocation  step,  we  need  to  re-compute  the  weights  of  the  nodes  and 
the  edges  of  the  MDG  based  on  the  new  allocation.  Next,  we  place  the  node  ST.\RT 
on  a  queue  called  the  ready  queue  and  mark  its  Earliest  Start  Time  {EST)  as  0. 

4.  In  this  step,  we  pick  a  node  from  the  ready  queue  that  has  the  lowest  possible  EST. 
We  then  check  to  see  the  time  at  which  the  processor  requirement  of  this  node  can  be 
met,  i.e.,  the  time  at  which  the  required  processors  will  be  done  with  the  node(s)  they 
are  currently  processing  and  can  accept  another  node.  This  is  called  the  Processor 
Satisfaction  Time  (PST).  If  PST  >  EST,  the  node  can  be  scheduled  at  PST  else,  it 
can  be  scheduled  only  at  EST.  It  must  be  noted  that  there  will  be  some  idle  time  in 
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the  latter  case  since  the  required  processors  are  available  but  not  used.  However,  the 
scheduler  is  not  forcing  idleness,  it  simply  does  not  have  any  other  node  to  schedule 
since  we  have  picked  the  node  with  the  lowest  EST. 

5.  If  the  node  just  scheduled  is  the  STOP  node,  the  scheduler  is  terminated  else,  we  go 
to  the  next  step. 

6.  After  scheduling  the  node,  we  now  check  to  see  if  zmy  of  its  successors  have  all  their 
predecessors  scheduled,  i.e.  have  their  precedence  constraints  satisfied.  If  so,  we  com¬ 
pute  the  EST  for  those  nodes  based  on  the  node  and  edge  weights  of  the  MDG  and 
the  schedule  built  so  far.  Such  nodes  are  then  placed  in  the  ready  queue. 

7.  Steps  are  repeated  starting  at  Step  4. 

The  finish  time  of  the  STOP  node  based  on  the  schedule  is  the  predicted  finish  time  of 
the  program. 

The  scheduling  algorithm  described  above  is  a  variant  of  the  popular  List  Scheduling 
Algorithm  (LSA)  which  has  been  used  for  example,  by  Liu  in  [37],  by  Garey,  Graham  and 
Johnson  in  [38],  by  Wang  and  Cheng  in  [39],  by  Belkhale  and  Banerjee  in  [13],  by  Turek, 
Wolf  and  Yu  in  [40],  and,  by  Ramaswamy  and  Banerjee  in  [14,  15].  It  must  be  noted  that 
some  of  the  mentioned  researchers  also  use  variants  of  the  LSA.  We  call  it  the  PSA  because 
of  the  implicit  prioritization  in  Step  4  where  a  node  with  the  lowest  EST  is  picked  even 
though  other  nodes  may  be  ready  for  scheduling. 

In  the  case  where  the  number  of  processors  used  by  any  node  is  bounded,  the  PSA  is 
shown  to  be  within  a  factor  of  the  optimum  in  Theorem  1  in  Section  6.  While  similar 
results  have  been  shown  in  the  references  mentioned  above  when  there  aie  no  data  transfer 
costs,  our  result  is  unique  in  that  it  takes  into  account  these  costs.  In  fact,  it  is  the  first  such 
result  to  be  derived. 

5  Mathematical  Cost  Models 

This  section  deals  with  the  important  aspect  of  choosing  appropriate  functions  to  represent 
the  processing  and  data  transfer  costs  involved  in  an  MDG.  The  cost  functions  we  choose 
have  to  satisfy  two  criteria;  they  have  to  be  convex  or  posynomial  functions,  and,  they 
have  to  be  practical.  In  this  section  we  show  that  our  cost  functions  are  posynomials,  their 
suitability  is  shown  in  Section  7. 
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The  processing  cost  function  we  use  is  an  often  used  model.  On  the  other  hand,  the  data 
transfer  cost  functions  are  our  own.  The  derivation  of  these  functions  is  described  in  detail 
in  [41]. 

Processing  Cost  Model 

For  the  processing  cost  model,  we  use  Amdahl’s  law,  i.e.,  the  execution  time  of  the  routine 
corresponding  to  the  tth  node  (tf )  as  a  function  of  the  number  of  processors  it  uses  (p,)  is 
given  by: 

•  n  (10) 

Pi 

where  r,-  is  the  execution  time  of  the  routine  on  a  single  processor  and  ai  is  the  fraction  of 
the  routine  that  has  to  be  executed  serially.  It  can  be  seen  that: 

0<Qi<  I 

0<Ti  (11) 

The  way  we  calculated  alpha  and  tau  for  the  various  routines  used  in  our  benchmarks 
is  by  actually  measuring  execution  times  for  these  routines  as  a  function  of  the  number  of 
processors  they  use  and  then  using  linear  regression  to  lit  the  measured  vjdues  to  a  function 
of  the  form  we  have  chosen.  In  the  future,  we  are  considering  the  use  of  static  techniques 
to  predict  these  values.  At  this  point,  we  only  wish  to  demonstrate  that  processing  costs 
can  be  modeled  by  a  function  of  the  form  shown  above.  As  our  results  will  show,  our  form 
models  processing  costs  fairly  accurately  in  practice. 

Lemma  1  tf  is  a  posynomial  function  xo.r.t.  pi. 

Proof  :  From  Equation  10  we  can  see  that  tf  is  made  of  two  components;  a  constant 
component  or,  •  r,  and  a  variable  component  The  first  component  is  a  posynomial 

since  it  is  a  non- negative  constant  (under  Equation  11).  The  second  component  has  a 
non-negative  constant  factor  multiplying  a  posynomial  Under  the  Constant  Property 
discussed  in  Section  2,  this  component  is  also  a  posynomial.  Since  both  components  axe 
posynomials,  using  the  Sum  Property  of  Section  2,  tf  is  a  posynomial.  □ 

Lemma  2  tf  •  pi  is  a  posynomial  function  w.r.t  pi. 
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Proof :  Using  Equation  10,  we  can  write  down: 


tf  =ai  •Ti*Pi  +  (l (12) 

We  can  see  that  this  equation  has  two  components;  a  variable  component  •  p,  which 
has  a  non-negative  constant  factor  a,-  -  n  multiplying  a  possmomial  p,-.  By  the  Constant 
Property  of  Section  2,  this  component  is  a  posynomial.  The  other  component  in  the  equation 
above  is  a  non-negative  constant  which  is  also  a  posynomial.  Hence,  using  the  Sum  Property 
of  Section  2,  we  see  that  tf  •  p{  is  a  posynomial.  □ 

We  would  like  to  point  out  that  a  and  r  for  a  routine  could  depend  on  the  size  of  data 
input  to  the  routine.  This  does  not  affect  the  statements  made  in  either  Lemma  above. 
Data  Transfer  Cost  Model 

Here  we  consider  the  cost  of  redistribution  of  an  array  of  data  elements  between  the 
execution  of  two  nodes  of  the  MDG  involving  pi  and  pj  processors  at  the  sending  and  receiving 
ends  respectively.  For  modeling  such  a  transfer,  we  assume  that  the  array  is  distributed 
evenly  across  the  p,-  sending  processors  initially,  and  across  the  pj  receiving  processors  finally. 
In  addition,  we  assume  that  the  number  and  sizes  of  messages  will  be  same  for  each  sending 
processor  and  for  each  receiving  processor.  For  example,  every  sending  processor  may  send 
3  messages  of  1000  bytes  and  every  receiver  may  receive  5  messages  of  1500  bytes.  These  are 
both  valid  assumptions  for  the  realm  of  regular  computations  which  we  eue  dealing  with. 

The  regular  distributions  of  an  array  along  any  of  its  dimensions  (size  along  dimension 
is  5)  ZLcross  a  set  of  p  processors  are  classified  into  the  following  cases: 

•  ALL  :  All  elements  of  the  array  along  the  dimension  are  owned  by  the  same  processor 

(p=  1) 

•  BLOCK  :  Elements  of  the  array  are  distributed  evenly  across  all  the  processors  with 
each  processor  owning  a  contiguous  block  of  elements  of  size  5/p. 

•  CYCLIC  :  Elements  of  the  array  are  distributed  evenly  across  all  the  processors  in  a 
round  robin  fashion  with  each  processor  owning  every  pth  element,  the  ith  processor 
starting  at  element  i. 

•  BLOCKCYCLIC(A’)  :  Elements  of  the  array  are  distributed  evenly  across  all  the 
processors  in  a  round  robin  fashion  with  each  processor  owning  every  pth  block  of  X 
elements,  the  ith  processor  starting  at  the  ith  block  of  X  elements. 


18 


Detjuls  of  regular  distributions  can  be  found  in  [2,  1].  For  our  discussion  of  data  transfer 
costs,  the  distribution  of  an  array  can  change  from  any  of  those  listed  above  to  any  other 
along  one  or  more  of  its  dimensions. 

In  considering  costs  for  any  type  of  array  transfer  from  node  i  to  node  j,  we  have  already 
seen  that  there  will  be  three  basic  components  :  a  sending  component  tfj,  a  network  compo¬ 
nent  tjj,  and,  a  receiving  component  We  have  also  seen  that  tfj  is  accounted  for  in  the 
weight  of  node  i,  tjj  is  taken  to  be  the  weight  of  the  edge  joining  node  i  and  node  J,  and,  tfj 
is  accounted  for  in  the  weight  of  node  j.  The  reason  for  doing  this  is  that  and  require 
processor  involvement,  whereas  does  not. 

We  propose  the  following  expressions  for  the  three  cost  components: 


tP  _  ^ _ ^ 

Pi '  Sij(pi,pj) 

”  ^iiPiiPj)  '  L  •  ’  tpr 

Pi 

where. 


(13) 


•  L  \s  the  length  (in  bytes)  of  the  array  being  transferred 

•  t$ai  tpa  are  the  startup  amd  per  byte  cost  for  sending  messages  from  a  processor 

•  in  is  the  network  cost  per  message  byte 

•  tjr,  tpr  are  the  startup  and  per  byte  cost  for  receiving  messages  at  a  processor 

•  Sij  is  the  number  of  messages  sent  from  each  sending  processor 

•  Rij  is  the  number  of  messages  received  at  each  receiving  processor. 

Intuitively,  the  sending  component  {tfj)  for  eanh  sending  processor  involves  a  startup  cost 
for  eau:h  of  the  Sij  messages  sent  and  a  processing  cost  for  its  shaire  of  the  airray  (p).  The 
same  logic  holds  for  the  receiving  component  for  receiving  processors.  The  network  compo¬ 
nent  represents  the  minimum  delay  required  for  messages  to  be  delivered  to  the  receiving 
processors  after  they  have  been  sent  from  the  sending  processors.  If  we  assume  a  pipelined 
network  with  no  congestion  effects,  this  delay  will  depend  on  the  length  of  the  laist  message 
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Distribution 

Block  Factor 

ALL 

L 

BLOCK 

■  L 

Pi 

CYCLIC 

1 

BLOCKCYCLIC(A’) 

X 

Table  1:  Block  Factors  for  Various  Regular  Distributions 


Distribution 

Skip  Factor  | 

ALL 

mam 

BLOCK 

mam\ 

CYCLIC 

mmmm 

BLOCKCYCLIC(A') 

■Bttsii 

Table  2:  Skip  Factors  for  Various  Regular  Distributions 


sent.  By  our  assumption  of  equal  sized  messages,  we  see  that  the  size  of  each  message  will 
be  reasoning  behind  the  network  cost  component  expression  shown. 

It  can  be  seen  that  the  quantities  and  Rij  will  depend  on  the  kind  of  redistribution 
occurring.  It  is  possible  to  express  these  quantities  in  terms  of  a  peiir  of  parameters  of  the 
sending  2Uid  receiving  distributions.  The  first  of  these  par2uneters  is  called  the  Block  Factor 
(BF),  it  provides  a  measure  of  the  sizes  of  the  blocks  of  elements  a  processor  owns  under 
any  of  the  regular  distributions.  The  Block  Factor  for  the  different  regular  distributions  of 
an  eirray  of  L  bytes  on  p,-  processors  is  shown  in  Table  1.  The  other  parameter  we  use 
is  called  the  Skip  Factor  (SF),  it  provides  an  idea  of  the  distance  between  the  successive 
blocks  of  elements  a  processor  owns.  We  have  listed  the  Skip  Factors  for  the  various  regular 
distributions  of  an  array  of  L  bytes  on  pi  processors  in  Table  2.  We  now  write  down  the 
expressions  for  5,j  and  Rij  as: 


-  _  SF,  BFi  SF,  BFi. 

S',  SFi  ’  BFj ’  SFi  BFj  ^ 

_  _  SFi  BFj  SFi  BFj. 

Ri,  -  max(l,  Bp.'  SFj  '  BFi^ 


(14) 


where  BFi  and  SFi  are  the  Block  Factor  and  Skip  Factor  for  the  sending  distribution;  BFj 
and  SFj  are  the  Block  Factor  and  Skip  Factor  for  the  receiving  distribution. 
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In  all  the  expressions  above,  we  have  omitted  some  details  in  order  to  make  them  more 
understandable.  First,  we  have  considered  only  a  one-dimensional  array  being  transferred 
in  all  the  cost  functions.  In  practice,  arbitrary  n-dimensional  arrays  may  be  redistributed. 
In  addition,  the  redistribution  may  not  be  confined  to  a  single  array,  more  than  one  2krray 
may  need  to  be  redistributed  between  a  pair  of  nodes  with  the  type  of  redistribution  being 
different  for  eaw:h  of  the  arrays.  It  is  easy  to  extend  our  functions  to  account  for  these  effects; 
we  have  not  shown  these  extended  forms  as  they  are  complex  and  lengthy.  Our  actual 
implementation  uses  an  extended  form  of  these  functions. 

Lemma  3  tfj,  and  tfj  art  posynomial  functions  w.r.t.  p,  and  pj  for  all  possible  cases  of 
redistributions. 

Proof ;  A  complete  proof  would  require  us  to  show  that  the  statement  above  is  true  for  all 
C2ises  of  redistributions.  However,  the  lack  of  space  prevents  us  from  doing  this,  detciils  can 
be  found  in  [41].  Instead,  we  show  that  the  statement  holds  for  a  pair  of  cases: 

•  Case  A:  BLOCK  to  BLOCK. 

For  this  case,  the  expressions  for  Sij  and  Rij  (using  Tables  1  and  2  and  Equation  14) 
are  given  by: 

Sij  =  max(l,  ~) 

Pi 

=  max(l,  — )  (15) 

Pi 

Using  these  values  in  Equation  13,  we  obtain: 


tfj  =  max(  1,  — )  •  t,,  -H  I  —  •  tp. 
Pi  Pi 

°  p,  •  m«(l, 

tfj  =  max(l,  — )  •  +  L  •  —  •  tpr 


Proceeding  in  a  manner  similar  to  the  one  used  for  the  processing  cost  function  (using 
the  posynomial  function  examples  and  properties  of  posynomial  functions),  it  can  easily 
be  shown  that  tf^  tfj.,  and  tf  are  all  posynomial  functions  w.r.t.  p,  and  pj. 


•  Case  B:  BLOCKCYCLIC(X)  to  BLOCKCYCLIC(y) 

For  this  case,  the  expressions  for  Sij  and  Rij  (using  Tables  1  and  2  and  Equation  14) 
are  given  by; 

C  _  mav/1  Fj  '  ^  ^  Pj  \ 

5.,  -max(l,  ^  ) 

/2.J  =  max(l,-2:--^,^,— )  (17) 

Pi'  X  X  pj 

Using  these  values  in  Equation  13,  we  obtain: 


tfj  =  max(l,  +  — 

Pi  ‘  Y  Y  Pi  Pi 


Pi  •max(l,2jif,f,2t) 


,  .  ,L  L  Y  L-Y  L,  ^ 

tn=^tij~  mm(— , - - ,  —  •  U 

^  Pi  Pj-^  Pi  Pj 


tfj  =  max(l,  .  f„  +  1 . 1 .  t 

Pj-X  X  Pj  Pj  ^ 


Proceeding  in  a  manner  similar  to  the  one  used  for  the  processing  cost  function  (using 
the  posynomial  function  examples  and  properties  of  posynomial  functions),  it  can  easily 
be  shown  that  tfj,  tjj,  and  tjj  are  all  posynomial  functions  w.r.t.  p,  and  pj. 


Lemma  4  tfj  •  pi  and  tfj  •  pj  are  posynomial  functions  w.r.t.  pi  and  pj  for  all  possible  cases 
of  redistributions. 


Proof  :  A  complete  proof  would  require  us  to  show  that  the  statement  above  is  true  for  aJl 
cases  of  redistributions.  However,  the  lack  of  space  prevents  us  from  doing  this,  details  can 
be  found  in  [41].  Instead,  we  show  that  the  statement  holds  for  a  pair  of  cases: 

•  Case  A:  BLOCK  to  BLOCK. 

As  shown  in  the  previous  lemma,  the  expressions  for  tfj  cind  are  given  by; 

tfj  =  max(l,  +  —  ■  fp. 

Pi  Pi  ' 

tj  =  max(l,— )-f„.  +  L-  — -tp,.  (19) 

Pj  Pj 
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We  can  now  write  down  the  expressions  for  tfj  •  pi  and  ■  pj  as: 


•  Pi  =  max(p„pj)  tj,, 

tfj  Pj=  max(pj,  p.)  •  +  X  •  fpr  (20) 

Proceeding  in  a  manner  similar  to  the  one  used  for  the  processing  cost  function  (using 
the  posynor'ial  function  examples  and  properties  of  posynomial  functions),  it  can  easily 
be  shown  that  tfj  •  pi  and  •  pj  are  both  posynomial  functions  w.r.t.  p,  and  pj. 

•  Case  B:  BLOCKCYCLICCX)  to  BLOCKCYCLIC(r) 

As  shown  in  the  previous  lemma,  the  expressions  for  tfj  and  tfj  are  given  by: 

tfj  =  max(l,  7)  •  ^  7 

Pi-Y  Y  Pi  Pi 

tg=:max(l,^^,~,^)-t„  +  x- (21) 

Pj  XX  Pj  Pj 

We  can  now  write  down  the  expressions  for  tfj  •  p,  and  tf  •  pj  as: 

S  /  Pi  ’  ^  ^  ’  Pi  \ 

tij  •  Pi  =  max(p,-,  ~~Y~"  Pj ^  ^ 

tfj'Pj  =  max(pj,  Pi)  •  +  L  •  tp,  (22) 

Proceeding  in  a  manner  similar  to  the  one  used  for  the  processing  cost  function  (using 
the  posynomial  function  examples  and  properties  of  posynomi2d  functions),  it  can  easily 
be  shown  that  tf  •  pi,  and  tf  •  pj  are  both  posynomial  functions  w.r.t.  p,  and  pj. 

Having  shown  the  statement  of  the  Lemma  true  for  the  two  example  cases,  we  extend 
this  result  to  cover  adl  the  possible  cases  of  redistribution.  □ 

6  Optimality  of  the  Allocation  and  Scheduling  Method 

While  developing  the  Allocation  algorithm,  we  assumed  the  existence  of  a  perfect  scheduling 
algorithm.  Since  the  actual  scheduling  algorithm  we  use  is  not  perfect,  our  methods  may 
not  achieve  the  optimum  value  in  pr2M:tice.  The  theoretical  results  that  follow  quantize  the 
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deviations  of  our  <iIgorithms  from  the  best  possible  solution.  In  deriving  these  theorems,  we 
have  assumed  that  the  underlying  computation  and  communication  cost  functions  axe  of  the 
form  discussed  in  the  previous  section. 

We  present  below  a  definition  of  a  term  used  in  the  proof  of  the  theorem  that  follows. 
Definition  1  Area  of  Useful  Work 

When  a  schedule  5  is  used  for  a  given  MDG  on  a  given  multicomputer  system,  the  area  of 
useful  work  (W,)  done  by  it  is  defined  as: 


‘U  •  (23) 

where,  is  the  ith  interval  during  which  a  constant  number  (p‘)  processors  are  kept  busy 
by  the  schedule.  The  quantity  denotes  the  total  number  of  such  intervals. 

Theorem  1  Assume  we  are  given  an  MDG  with  n  nodes  and  a  processor  allocation  such 
that  no  node  uses  more  than  PB  processors.  Let  Tp,a  denote  the  value  of  the  finish  time 
obtained  by  scheduling  this  MDG  on  a  given  p  processor  system  using  the  PS  A  algorithm 
and  T^f  denote  the  value  obtained  using  the  best  possible  scheduler.  The  relationship  between 
these  two  quantities  is  given  by: 

<  (1  +  — pg— f)  • 

Proof: 

In  the  best  case  the  area  of  useful  work  done  by  the  optimal  scheduling  algorithm  can 
be  p  •  T^f.  This  is  because  it  can,  at  best,  keep  all  p  processors  in  the  system  for  the  entire 
length  of  the  schedule  it  produces.  If  the  work  done  by  the  PSA  is  denoted  by  Wpsa,  we  can 
write: 


(25) 

If  any  node  uses  at  most  PB  processors,  we  can  say  that  the  PSA  being  unable  to 
schedule  the  next  node  immediately  means  it  hz«  at  lejist  p  —  PB  4-  1  processors  busy 
currently.  However,  as  we  will  see  later,  this  will  not  always  be  true.  If  the  duration  when 
this  is  not  true  is  A  (in  the  worst  case),  we  can  write  (using  the  definition  of  useful  work): 

W,,a  >  (T,sa  -  A)  •  (p  -  +  1)  +  Wa  (26) 


24 


Here  we  axe  assuming  VV^  is  the  worst  case  useful  work  (if  any)  done  during  the  periods 
when  less  than  p  —  PB  +  1  processors  are  busy. 

If  greater  than  PB  processors  are  idle,  it  means  the  PSA  algorithm  has  a  case  when 
PST  <  EST  for  all  the  unscheduled  nodes  (Refer  Section  4).  This  implies  that  every 
other  unexecuted  node  is  dependent  on  the  currently  ongoing  events  which  may  be  a  node 
execution  or  a  edge  delay  in  progress.  It  is  also  clear  that  such  a  situation  could  occur  many 
times  in  the  building  up  of  the  schedule. 

Let  us  call  a  situation  such  as  the  one  described  above  an  Idling  Situation  {IS).  We  now 
contend  that  one  or  more  of  the  events  involved  in  the  tth  such  IS  {I Si)  control  each  of  the 
the  events  of  every  subsequent  IS  (ISj  for  all  j  >  i).  If  this  were  not  true,  it  means  we  can 
find  some  node  execution  or  edge  delay  in  an  ISk,  k  >  i  such  that  no  event  in  75,  controls  it. 
In  such  a  case  this  node  execution  or  edge  delay  would  have  been  scheduled  concurrently  with 
the  events  in  ISi,  which  mesuis  it  cannot  belong  to  ISk  which  is  a  contradiction.  Therefore 
our  contention  is  true. 

The  implication  of  this  dependence  between  events  in  IS's  is  that  they  must  form  a  set 
of  paths  (partial  or  complete)  in  the  given  MDG.  We  know  that  the  length  of  any  path  in 
the  MDG  is  bounded  by  the  length  of  the  critical  path.  Therefore,  in  the  worst  case,  we 
can  see  that  the  total  duration  for  which  /5’s  can  occur  in  the  schedule  is  the  length  of  the 
critical  path.  Since  T^f  must  be  at  least  the  length  of  the  critical  path,  we  can  write: 


^  (27) 

It  cem  be  seen  that  in  the  worst  case,  no  processors  will  be  busy  during  any  IS  (edl 
events  are  edge  delays),  implying  no  work  is  done.  This  would  give  us  a  >  0.  Using  this 
inequality  and  equation  27  in  26,  we  have: 


Wp,„  >  -  T^f)  -ip-PB  +  l) 

From  Equations  25  and  28,  we  have: 


<(1  + 


P  \  rpPB 
p-PB+l’ 


which  is  the  required  result  □. 


(28) 


(29) 


Theorem  2  In  the  first  two  steps  of  the  PSA  we  modify  the  processor  allocation  produced 
by  the  convex  programming  formulation  of  Section  3.  If  T^f  denotes  the  value  of  the  finish 
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timt  obtained  for  the  given  MDG  on  a  p  processor  system  with  this  modified  allocation  using 
the  best  possible  scheduler,  we  have: 


where,  $  is  the  solution  obtained  from  the  convex  programming  formulation. 

Proof:  We  first  look  at  the  effect  of  increasing  or  decreasing  the  number  of  processors  used 
by  the  nodes  of  the  MDG  on  the  value  of  its  average  finish  time  amd  critical  path  time.  This 
cam  be  seen  from  the  definition  of  these  quaintities  in  Section  3  amd  the  cost  functions  of 
Section  5. 

From  this  information,  we  cam  see  that  if  we  increase  the  aillocation  to  any  node  i  from 
Pi  to  Pi,  its  contribution  to  the  average  cam  increase  by  a  famtor  of  no  more  than  This 
factor  comes  about  because  of  the  startup  component  in  tf^  amd  tfj.  On  the  other  hand,  it 
is  also  evident  that  decreasing  the  processor  adlocation  for  amy  node  will  only  decrease  the 
vaiue  of  the  average. 

.Agaiin,  by  looking  closely  at  the  materiad  in  the  sections  mentioned,  we  see  that  increasing 
the  adlocation  to  any  node  i  from  p,-  to  p^  will  increase  the  critical  path  by  a  factor  no  more 
than  This  is  because  of  the  startup  component  in  tjj  amd  tfj.  Similairly,  decreasing  the 
processor  adlocation  of  a  node  i  from  p,-  to  p^  could  adso  increase  the  criticad  path.  This  time 
the  factor  may  be  up  to  This  is  because  of  the  structure  of  tj^. 

Having  seen  this,  we  now  exaunine  the  effect  of  the  initial  steps  of  the  PSA  on  the  values 
of  the  average  amd  critical  path  produced  by  the  convex  programtiming  formulation  (Ap  amd 
Cp). 

In  order  to  madce  our  adlocation  practical,  we  first  rounded-off  the  processor  allocation  in 
Step  1  of  the  PSA.  Since  we  round-off  to  the  nearest  power  of  2,  it  can  be  shown  that  the 
processor  allocation  for  the  zth  node  is  chamged  at  most  by  5  of  its  original  value,  i.e.,  p,  can 
decrease  to  ^  or  increase  to  ^  in  the  worst  case.  Let  the  value  of  the  average  finish  time 
and  criticad  path  time  of  the  MDG  thus  adlocated  be  denoted  by  Afto  and  Cro  respectively. 
From  the  discussion  on  the  effect  of  increase  or  decrease  of  processor  allocation  ,  we  can 
write: 


Abo  <  ( j)' ■  A,  ;  C«o  <  {|)=  •  Cp  (31) 

After  performing  the  round-off,  we  imposed  a  bound  on  the  number  of  processors  used  by 
each  node  in  Step  2.  The  value  of  PB  we  use  is  assumed  to  be  a  power  of  2.  If  not,  we  would 
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have  to  round-off  again  2uid  might  end  up  making  some  pi's  more  that  PB,  which  renders 
the  bound  useless.  The  net  effect  of  this  step  is  of  a  decrease  in  the  processor  allocation  of 
some  nodes,  and  no  change  in  the  processor  allocation  of  others.  The  worst  case  decrease  for 
amy  node  is  clearly  from  p  to  PB.  If  Apb  is  the  value  of  the  average  finish  time  and  Cpb  is 
the  value  of  the  critical  path  time  for  this  bounded  allocation,  using  the  discussion  on  effects 
of  processor  increase  or  decrease,  we  have: 

ApB  <  Abo  ;  Cpb  <  ‘  (^2) 

Since  T^f  denotes  the  time  obtained  by  using  the  best  scheduler  on  this  rounded-off  and 
bounded  processor  allocation,  we  can  write: 


T^f  =:  max(i4/»B,C'pB) 

Using  Equations  31  and  32  in  the  equation  above  we  have: 


(33) 


'pPB 
^  opt 


From  the  equation  above  and  the  definition  of  $  in  Section  3,  we  have: 


(34) 


rpPB 

opt 


(35) 


which  is  the  required  result  □. 

Intuitively,  this  theorem  summarizes  the  effect  of  our  rounding  off  and  bounding  steps. 
It  tells  us  how  much  the  solution  can  deviate  from  the  optimal  even  if  we  used  the  best 
possible  scheduler  after  having  applied  these  steps.  In  the  next  theorem,  we  summarize  all 
effects,  i.e.,  using  the  PSA  to  schedule  adter  the  round-off  amd  bounding  steps. 


Theorem  3  Let  Tpsa  denotes  the  value  of  the  finish  time  obtained  for  a  processor  allocation 
using  the  convex  programming  formulation  of  Section  3  and  the  PSA.  Then,  we  have: 

+  (36) 

where,  $  is  the  solution  obtained  from  the  convex  programming  formulation. 


Proof:  This  result  is  a  direct  consequence  of  the  previous  theorems  (  1  and  2)  □. 
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Corollary  1  The  power  of  2  that  minimizes  the  value  of  the  following  expression  is  the 
optimum  value  of  PB  to  use  for  the  PSA: 


(1  + 


— E - ) .  (£)2 .  (-^)2 


(37) 


Proof:  From  Theorem  3  it  is  clear  that  the  expression  to  be  minimized  is  the  one  given 
above. 

As  we  have  discussed  in  Section  4,  we  must  choose  a  PB  that  is  a  power  of  2  or  we 
may  end  up  with  an  infeasible  solution.  A  feasible  solution  is  one  in  which  the  processor 
allocation  for  any  node  is  a  power  of  2  as  well  as  bounded  by  PB.  Hence,  the  result  □. 


7  Implementation  and  Results 

The  allocation  and  scheduling  algorithms  proposed  above  were  tried  out  on  three  benchmark 
MDGs.  The  MDGs  were  hand  generated  after  studying  the  programs  they  correspond  to 
and  are  shown  in  Figure  9.  Our  testbed  machines  were  a  128  node  Thinking  Machines  CM'5 
and  a  128  node  Intel  Paragon. 

The  first  MDG  corresponds  to  multiplication  of  two  complex  matrices  of  128  x  128  ele¬ 
ments.  It  has  few  nodes  and  is  relatively  simple.  The  other  MDG  we  used  corresponds  to 
the  Strassen’s  algorithm  for  multiplication  of  a  pair  of  matrices  of  size  256  x  256  elements. 
This  is  a  more  complex  MDG  with  many  more  nodes  than  the  previous  one.  The  book  by 
Press  et.  al.  [42]  describes  Strassen’s  algorithm  in  detail  eind  explains  its  usefulness.  Our 
third  benchmark  MDG  corresponds  to  a  Fotirier-Chebyshev  spectral  Computational  Fluid 
Dynamics  (CFD)  algorithm  applied  on  a  128  x  128  x  65  grid.  Details  of  this  algorithm  can 
be  obtaiiied  from  [43].  The  important  routines  used  in  our  benchmark  MDGs  axe  Matrix 
Multiply,  2D  FFT,  Matrix  Add,  and.  Matrix  Subtrau:t. 

Having  obtained  the  MDGs,  we  used  MAST  to  study  their  execution  profiles  using  32.  64 
and  128  processors  on  both  target  architectures.  MAST  generated  the  SPMD  and  the  MPMD 
versions  of  code  for  all  the  benchmark  MDGs  so  that  we  could  compare  the  performance 
obtained  for  the  two  cases.  For  the  SPMD  case,  every  node  in  the  MDG  uses  all  the 
processors  available;  there  are  no  data  redistributions.  For  the  MPMD  case,  we  perform 
allocation  and  scheduling  using  our  methods,  data  redistributions  may  be  needed.  The 
speedups  and  execution  efficiencies  obtained  are  shown  in  Figure  11  for  the  CM-5  and 
in  Figure  12  for  the  Paragon.  From  these  figures  it  can  be  seen  that  speedups  obtained 
for  the  MPMD  programs  are  much  higher  as  compared  to  SPMD  versions,  especially,  for 
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Figure  10:  Allocation  and  Scheduling  of  Complex  Matrix  Multiply 
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Machine 

Benchmark 

Predicted 
Time  (Secs) 

Actuad 
Time  (Secs) 

Error 

CM-5 

Complex  Matrix  Multiply 

0.484 

0.442 

4-9.5  % 

Strassen’s  Matrix  Multiply 

0.758 

0.766 

-1.0  % 

Computational  Fluid  Dynamics 

0.467 

0.426 

4-9.6  % 

Patragon 

Complex  Matrix  Multiply 

0.161 

0.187 

-13.9  % 

Strassen’s  Matrix  Multiply 

0.288 

0.306 

WESim 

Computational  Fluid  Dynamics 

0.266 

0.244 

+9.0  % 

Table  3:  Predicted  versus  Actual  Execution  Times  of  Benchmark  Programs  for  64  Processors 


larger  systems.  The  only  exception  to  this  observation  is  the  32  processor  case  for  the  CFD 
algorithm  on  the  CM-5.  Here,  the  SPMD  version  performs  slightly  better  than  the  MPMD 
version.  This  is  because  the  data  redistribution  overhead  for  the  MPMD  program  outweighs 
the  gzdns  obtained  by  efficient  execution  of  the  routines.  In  ail  other  cases  this  overhead 
is  more  than  amortized  by  the  efficient  execution  of  routines.  The  increased  performance 
benefits  obtained  for  larger  systems  makes  allocation  and  scheduling  critical  for  massively 
parallel  computing.  Intuitively,  the  benefits  of  using  functional  and  data  parallelism  together 
will  be  greater  when  most  of  the  available  data  parallelism  in  a  routine  has  been  exploited 
(this  happens  for  large  systems) 

Another  aspect  of  interest  is  the  practicality  of  our  models  for  processing  and  data  transfer 
costs.  In  order  to  check  this  we  have  plotted  the  predicted  ^lnd  measured  finish  times  of  the 
three  benchmark  programs  for  a  system  size  of  64  nodes  for  both  target  Mchitectures  in 
Table  3.  The  figure  shows  the  close  correspondence  of  the  two  quantities,  which  means  our 
cost  models  are  very  practical. 

8  Conclusions  and  Future  Work 

In  this  paper  we  have  presented  a  fraunework  for  exploiting  data  and  functional  parallelism 
together.  Basic  to  our  framework  is  the  MDG  representation  for  a  program.  The  MDG  is 
constructed  using  a  graphical  prograunming  tool.  Costs  for  its  nodes  and  edges  are  estimated 
using  cost  function  models  provided.  We  then  use  am  adlocation  and  scheduling  approach  on 
the  MDG  for  exploiting  functional  amd  data  paradlelism  together.  Allocation  is  performed  us¬ 
ing  a  convex  programming  approach  and  scheduling  is  done  using  a  variant  of  list  scheduling. 
The  performance  of  our  allocation  and  scheduling  approach  has  been  theoretically  analyzed; 
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Figure  12:  Speedup  and  Efficiency  Comparison  for  SPMD  and  MPMD  versions  of  Benchmark 
Programs  on  the  Intel  Paragon 
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practical  results  have  also  been  provided  which  show  it  is  very  effective. 

In  the  future  we  will  consider  the  following  extensions: 

•  Using  the  SPMD  compilation  techniques  developed  for  the  PARADIGM  compiler  [9, 
11,  10]  to  generate  data  parallel  versions  of  the  scientific  library  routines  in  MAST. 
Currently,  we  use  hand  coded  parallel  versions.  Using  the  SPMD  compilation  will  allow 
us  to  extend  the  scientific  library  in  an  easy  manner. 

•  Development  of  static  processing  cost  estimation  techniques  like  the  ones  described  in 
[8,  44).  This  way,  we  can  profile  any  user  specified  routine  and  not  confine  the  user  to 
using  routines  from  the  library  provided  in  MAST. 

•  Minimization  of  redistribution  costs  by  modifying  the  scheduling  algorithm.  Currently, 
this  algorithm  does  not  take  data  locality  into  account;  by  using  such  information,  it 
may  be  able  to  avoid  redistribution  costs  if  the  pair  of  nodes  involved  axe  executing  on 
the  same  set  of  processors  and  have  the  same  data  distributions. 
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